Recommender Systems are a set of methods able to predict the 'rating' or 'preference' that a user would give to an item. Among the different approaches to design this kind of systems, in this lab session we are going to work with Collaborative Filtering (CF) approaches, in particular, we will implement a user based CF model and the Alternative Least Square (ALS) approach.
Along the notebook we are going to use the dataset from MovieLens
. MovieLens data sets were collected by the GroupLens Research Project at the University of Minnesota. The original version of this problem contains 10 millions of ratings applied to 10681 movies by 71567 users. However, for this lab, we will use a reduced version consisting of 100,000 ratings (with values from 1 to 5) from 943 users on 1682 movies, where each user has rated, at least, 20 movies.
As you progress in this notebook, you will have to complete some exercises. Each exercise includes an explanation of what is expected, followed by code cells where one or several lines will contain <FILL IN>
. The cell that needs to be modified will have # TODO: Replace <FILL IN> with appropriate code
on its first line. Once the <FILL IN>
sections are updated, the code can be run; below this cell, you will find the test cell (beginning with the line # TEST CELL
) and you can run it to verify the correctness of your solution.
In [1]:
# Import some libraries
import numpy as np
import math
from test_helper import Test
from scipy import sparse
In [2]:
# Define data file
ratingsFilename = 'u.data'
f = open(ratingsFilename)
rawRatings = f.readlines()
# Check file format
print rawRatings[:10]
As you can checked, each line is formatted as: UserID \t MovieID \t Rating \t Timestamp \n
So, let's convert each line to a list with the fields [UserID, MovieID, Rating] (we drop the timestamp because we do not need it for this exercise). Note that the result of this conversion will provide a list of lists.
Tip: Check the Python function split()
to convert each line into a list of items splitted by a given character.
In [ ]:
###########################################################
# TODO: Replace <FILL IN> with appropriate code
###########################################################
formatRatings = # FILL IN
print formatRatings[:10]
In [4]:
###########################################################
# TEST CELL
###########################################################
Test.assertEquals(formatRatings[0], ['196', '242', '3'], 'incorrect result: data are incorrectly formatted')
Now, let's convert the above list of list into a data matrix. Due to the fact that each user has only rated a few items, we will have a lot of missing positions in this matrix. So, to store this information efficiently, we should use a sparse matrix format.
Scipy library includes a class to create different sparse matrices sparse matrices
. Given the data format that we are working, we will start creating a COOrdinate format (coo) matrix. This matrix can be created with the following command:
coo_matrix((data, (i, j)), [shape=(M, N)])
where data contains the entries of the matrix (our ratings), i is the row indexes of the matrix entries (the user ids who have generated the ratings) and j the column indexes of the matrix entries (the item ids that have been rated). Furthermore, it is advisable to indicate the matrix shape, then, values M and N indicates, respectively, the number of rows and columns (number of users and items) of the matrix.
Complete the cell below, following the indications, to construct the sparse matrix.
In [ ]:
###########################################################
# TODO: Replace <FILL IN> with appropriate code
###########################################################
# 1. Extract from the formatted data the list of users ids, item ids and the ratings.
# Check zip(*) function to unzip the formatRatings variable
[user, item, rating] = # <FILL IN>
# 2. Convert the elements of each list to integer values and build an numpy array with the resulting list
# Due to user, movie and ratign are python lists, you will have to apply the int() operator element by element,
# check the list comprehension syntaxis to speed up this
user = # <FILL IN>
item = # <FILL IN>
rating = # <FILL IN>
# 3. Calculate the number of users, item and ratings
number_users = # <FILL IN>
number_items = # <FILL IN>
number_ratings = # <FILL IN>
print number_users
print number_items
print number_ratings
# 4. Build the coo matrix (take into account that user and item ids start from 1 and python indexing starts in zero)
rating_matrix = coo_matrix((rating, (user-1,item-1)), shape =(number_users, number_movies)).tocsr()
In [6]:
###########################################################
# TEST CELL
###########################################################
Test.assertEquals((np.sum(user), np.sum(item) , np.sum(rating)), (46248475, 42553013, 352986), 'incorrect result: user, item or rating is incorrect')
Test.assertEquals(np.round(rating_matrix.mean(),2), 0.22, 'incorrect result: user, item or rating is incorrect')
Finally, let's transform the rating matrix into a Compressed Sparse Row (CSR) format
, since we will have to acces to the users values (read rows), so this format facilitates this type of access.
In [7]:
rating_matrix = rating_matrix.tocsr()
Next cell lets you visualize the values of (a reduced version of) the matrix as an image where each rating is depicted as a little square. Blue represents the absence of a rating and the yellow-red levels represent the rating value. The plot confirms that the matrix is sparse (most of the squares are blue).
You can adapt the code to visualize a larger fraction of the matrix (although, you should take care of the matrix size that you consider and your memory space).
In [8]:
%matplotlib inline
from matplotlib import pyplot as plt
imagedata = rating_matrix[:200, :200].toarray()
# Review the method .toarray(), we will use it later
plt.imshow(imagedata, interpolation='nearest')
Out[8]:
The above image also shows that:
Review the method .getnnz([axis]) of crs matrices to compute the number of stored values (nonzeros) by rows (axis=1) or columns (axis=0).
In [ ]:
###########################################################
# TODO: Replace <FILL IN> with appropriate code
###########################################################
# 1. Compute the number items of rated by each user
n_rat_user = # <FILL IN>
plt.figure(1, figsize=(4, 4))
hist_user = plt.hist(n_rat_user, histtype='bar', rwidth=0.8)
# 2. Compute the number ratings for each item
n_rat_item = # <FILL IN>
plt.figure(2, figsize=(4, 4))
hist_item = plt.hist(n_rat_item, histtype='bar', rwidth=0.8)
In [10]:
###########################################################
# TEST CELL
###########################################################
Test.assertEquals(hist_user[0][0], 560, 'incorrect result: histogram over the number of user ratings is incorrect')
Test.assertEquals(hist_item[0][0], 1146, 'incorrect result: histogram over the number of ratings by item is incorrect')
Now, to be able to train and evaluate the different methods, let's divide the rating matrix into two different matrix:
Hint: you may need the following functions:
In [ ]:
###########################################################
# TODO: Replace <FILL IN> with appropriate code
###########################################################
# From the reduced rating matrix extract the users, items and ratings. Use the sparse.find() method
[users, items, ratings] = #<FILL IN>
# Get the number of ratings
number_ratings = #<FILL IN>
# Compute the number of training ratings as the 75% of the total number of ratings
np.random.seed(0)
n_tr = #<FILL IN>
# Create a permuted range from 0 to the number of ratings
random_pos = #<FILL IN>
# Select the first n_tr positions of random_pos as the training positions,
# and the remaining random_pos indexes and the positions of the testing ratings
pos_tr = #<FILL IN>
pos_test = #<FILL IN>
# Build the training and testing rating matrices
# Create a coo_matrix and, then, convert it to csr format
# Note that the number of users and items has not changed
rating_train = #<FILL IN>
rating_test = #<FILL IN>
In [12]:
###########################################################
# TEST CELL
###########################################################
Test.assertEquals(n_tr, 75000, 'incorrect result: n_tr value is incorrect')
Test.assertEquals(np.sum(random_pos[:5]), 142525, 'incorrect result: random_pos values are incorrect')
Test.assertEquals(np.round(rating_train.mean(),2), 0.17, 'incorrect result: the values of the training rating matrix are incorrect')
Test.assertEquals(np.round(rating_test.mean(),2), 0.06, 'incorrect result: the values of the testing rating matrix are incorrect')
To speed up the evaluations of the recommender systems, let's create a reduced set with 50 testing users. Of course, you can later extend the method evaluation to all the users, although the evaluation of some methods can take several minutes.
In [13]:
np.random.seed(0)
all_users = np.random.permutation(number_users)
test_users = all_users[:50]
In this section we are going to build two baseline approaches. In particular, let's consider:
So, complete the following cells to implement the functions corresponding to these two approaches.
Hint: When you are working with a csr matrix R, you can get the i-th row of the matrix as R[i,:]
. And, if you need their non-zero values, you can use its attribute .data to extract them; i.e., R[i,:].data
will return the non-zero values of the i-th of a crs matrix.
In [ ]:
###########################################################
# TODO: Replace <FILL IN> with appropriate code
###########################################################
def mean_recomender(user_id, item_ids, ratings):
""" Calculate the output of a mean based recommender
Args:
user_id: id of the user to compute its output
item_ids: numpy array with the ids of the items whose rating has to be estimated
ratings: crs matrix with the ratings of all the users to the rated items
Returns:
pred_items: numpy array of dimensions 1 x number of items with the computed predictions for each item.
This prediction is computed as the mean value of the items which the user_id has already rated.
"""
# Compute the number of items in the rating matrix
n_items = #<FILL IN>
# Select the values of the items rated by the user_id
ratings_u = #<FILL IN>
# Compute the mean value of the ratings_u
if ratings_u.shape[0]>0:
mean_rating = #<FILL IN>
else: # Cold start problem (average rating)
mean_rating = 3
# Generate a numpy vector of dimensions 1 x n_items with all their values equal to mean_rating
pred_items = #<FILL IN>
return pred_items
In [15]:
# Testing function mean_recomender()
user_id = 20
item_ids = np.array([8, 0, 100])
pred_mean = mean_recomender(user_id, item_ids, rating_train)
print pred_mean
In [16]:
###########################################################
# TEST CELL
###########################################################
Test.assertEquals(pred_mean.shape, (1,3), 'incorrect result: pred_mean shape is incorrect')
Test.assertEquals(np.round(np.sum(pred_mean),2), 8.3, 'incorrect result: pred_mean values are incorrect')
In [ ]:
###########################################################
# TODO: Replace <FILL IN> with appropriate code
###########################################################
def mode_recomender(user_id, item_ids, ratings):
""" Calculate the output of a mode based recommender
Args:
user_id: id of the user to compute its output
item_ids: numpy array with the ids of the items whose rating has to be estimated
ratings: crs matrix with the ratings of all the users to the rated items
Returns:
pred_items: numpy array of dimensions 1 x number of items with the computed predictions for each item.
This prediction is computed as the mode value of the items which the user_id has already rated.
"""
# Compute the number of items in the rating matrix
n_items = #<FILL IN>
# Select the values of the items rated by the user_id
ratings_u = #<FILL IN>
# Compute the mean value of the ratings_u
if ratings_u.shape[0]>0:
mode_rating = #<FILL IN>
else: # Cold start problem (average rating)
mode_rating = 3
# Generate a numpy vector of dimensions 1 x n_items with all their values equal to mean_rating
pred_items = #<FILL IN>
return pred_items
In [18]:
# Testing function mode_recomender()
user_id = 20
item_ids = np.array([8, 0, 100])
pred_mode = mode_recomender(user_id, item_ids, rating_train)
print pred_mode
In [19]:
###########################################################
# TEST CELL
###########################################################
Test.assertEquals(pred_mode.shape, (1,3), 'incorrect result: pred_mean shape is incorrect')
Test.assertEquals(np.round(np.sum(pred_mode),2), 9, 'incorrect result: pred_mean values are incorrect')
To evaluate the performance of the different recommenders, we are going to use two measurements:
Thus, complete the following cells two build two functions able to compute the MAE and RMSE values.
In [ ]:
###########################################################
# TODO: Replace <FILL IN> with appropriate code
###########################################################
def get_MAE(pred_rating, real_rating):
""" Calculate the MAE
Args:
pred_rating: crs matrix, with dimensions n_users x n_items, with the predicted ratings.
real_rating: crs matrix, with dimensions n_users x n_items, with the real ratings.
Returns:
MAE: Mean Absolute Error computed over the non-zero entries of real_rating.
"""
# Extract the non-zero positions of real_rating and their values (use sparse.find() method)
[pos_users, pos_items, real_values] = # <FILL IN>
# Extract the predicted values of the non-zero positions
pred_values = # <FILL IN>
# Compute the MAE (check np.absolute method)
MAE = # <FILL IN>
return MAE
In [21]:
###########################################################
# TEST CELL
###########################################################
matrix_1 = sparse.eye(10).tocsr()
matrix_2 = (1.2*sparse.eye(10)).tocsr()
matrix_2[0,0]= 0.4
Test.assertEquals(np.round(get_MAE(matrix_1, matrix_2),2), 0.24, 'incorrect result: MAE value is incorrect')
In [ ]:
###########################################################
# TODO: Replace <FILL IN> with appropriate code
###########################################################
def get_RMSE(pred_rating, real_rating):
""" Calculate the RMSE
Args:
pred_rating: crs matrix, with dimensions n_users x n_items, with the predicted ratings.
real_rating: crs matrix, with dimensions n_users x n_items, with the real ratings.
Returns:
RMSE: Root Mean Square Error computed over the non-zero entries of real_rating.
"""
# Extract the non-zero positions of real_rating and their values (use sparse.find() method)
[pos_users, pos_items, real_values] = # <FILL IN>
# Extract the predicted values of the non-zero positions
pred_values = # <FILL IN>
# Compute the RMSE (check np.sqrt and np.square methods)
RMSE = # <FILL IN>
return RMSE
In [23]:
###########################################################
# TEST CELL
###########################################################
matrix_1 = sparse.eye(10).tocsr()
matrix_2 = (1.2*sparse.eye(10)).tocsr()
matrix_2[0,0]= 0.4
Test.assertEquals(np.round(get_RMSE(matrix_1, matrix_2),2), 0.27, 'incorrect result: RMSE value is incorrect')
Now, let's evaluate the performance of the mean and mode based baselines.
In [ ]:
###########################################################
# TODO: Replace <FILL IN> with appropriate code
###########################################################
# Compute the number of users and items
n_users, n_items = # <FILL IN>
# Create two empty prediction matrix in crs format
pred_mean_ratings = sparse.lil_matrix((n_users, n_items))
pred_mode_ratings = sparse.lil_matrix((n_users, n_items))
# Work user by user
for u in range(n_users):
# Get, form the test matrix, the item id to be predicted for this user (check .indices attribute of crs matrix)
item_ids = # <FILL IN>
# Get predictions with the mean based baseline for user u
pred_mean_u = # <FILL IN>
# Get predictions with the mode based baseline for user u
pred_mode_u = # <FILL IN>
# Build the prediction matrices
pred_mean_ratings[u,item_ids] = # <FILL IN>
pred_mode_ratings[u,item_ids] = # <FILL IN>
#Compute the error (MAE and RMSE) for each baseline method over the test_users
MAE_mean = # <FILL IN>
RMSE_mean = # <FILL IN>
MAE_mode = # <FILL IN>
RMSE_mode = # <FILL IN>
print 'Mean model ... MAE: %2.2f , RMSE: %2.2f ' % (MAE_mean, RMSE_mean)
print 'Mode model ... MAE: %2.2f , RMSE: %2.2f ' % (MAE_mode, RMSE_mode)
In [25]:
###########################################################
# TEST CELL
###########################################################
Test.assertEquals(np.round(MAE_mean,2), 0.84, 'incorrect result: MAE value of mean recommeder is incorrect')
Test.assertEquals(np.round(RMSE_mean,2), 1.04, 'incorrect result: RMSE value of mean recommeder is incorrect')
Test.assertEquals(np.round(MAE_mode,2), 0.86, 'incorrect result: MAE value of mode recommeder is incorrect')
Test.assertEquals(np.round(RMSE_mode,2), 1.19, 'incorrect result: RMSE value of mode recommeder is incorrect')
In this section let's build a user-based collaborative filter by finding users who are similar to each other and user their rated items to predict the ratings over item which haven't been rated by other users.
The general algorithm will be (in pseudocode) as follows:
The similarity between different users is determined by comparing each user with every other user and computing a similarity score. Here, we are going to use the Pearson correlation coefficient: $$ sim(user_a, user_b) = \frac{\sum_{p \in P} (r_{a,p} -\bar{r}_a)(r_{b,p} -\bar{r}_b)} {\sqrt{ \sum_{p \in P} (r_{a,p} -\bar{r}_a)^2} \sqrt{ \sum_{p \in P} (r_{b,p} -\bar{r}_b)^2}}$$
where $P$ is set of items rated for both users a and b, $r_{u,p}$ is the rating of the user u to item p, and $\bar{r}_u$ is the mean value of the all the ratings of the user u.
Complete the next code so that it is able to obtain the correlation coefficient between to given users.
In [ ]:
###########################################################
# TODO: Replace <FILL IN> with appropriate code
###########################################################
def compute_Pearson_correlation(ratings, id_u1, id_u2):
""" Calculate correlation coefficient
Args:
ratings: crs matrix, with dimensions n_users x n_items, with the ratings used to measure similarities.
id_u1: id user 1
id_u2: id user 2
Returns:
corr_value: correlation coefficient
"""
# Get the indexes and values of the items rated by user 1 (use sparse.find() function)
[pos_u1, items_u1, values_u1] = # <FILL IN>
# Get the indexes and values of the items rated by user 2 (use sparse.find() function)
[pos_u2, items_u2, values_u2] = # <FILL IN>
# Get the set of items rated by both users (you can use np.intersect1d() method)
items_intersect = # <FILL IN>
if items_intersect is not None: # If the are common rated items...
# Compute the mean values of all the items rated by user 1 and user 2
m_1 = # <FILL IN>
m_2 = # <FILL IN>
# Get the ratings of users 1 and 2 in items_intersect (you can use .toarray() method)
r_u1 = # <FILL IN>
r_u2 = # <FILL IN>
# Remove their means
r_u1 = # <FILL IN>
r_u2 = # <FILL IN>
# Compute the correlation coefficient
corr_value = # <FILL IN>
# Remove useless dimensions
corr_value =np.squeeze(corr_value)
else: # Else correlation is 0
corr_value = 0
# Checking that the correlation is not NaN (this would happen if the denominatior is 0),
# in this case, set the corrlation coefficient to 0
if math.isnan(corr_value):
corr_value = 0
return corr_value
In [43]:
###########################################################
# TEST CELL
###########################################################
Test.assertEquals(np.round(compute_Pearson_correlation(rating_train, 5, 12),2), 0.36, 'incorrect result: correlation value is incorrect')
Once you know how a user is similar to other users, you would now like to know the movies that should be recommended for this user. So, let's assign a rating to each item by averaging the ratings that the similar users have given to that item according to this expression: $$ pred(user_a, item_i) = \bar{r_a} + \frac{\sum_{b \in N} sim(user_a, user_b) * (r_{b,i}- \bar{r_b})}{\sum_{b \in N} sim(user_a, user_b)}$$ where N is the number of neighbors of user a ($sim >0$) which have rated item i.
Thus, let's create a function, that using the correlation coefficient as similarity value, compute for a given user the ratings over some unseen items.
In [ ]:
###########################################################
# TODO: Replace <FILL IN> with appropriate code
###########################################################
def user_sim_recommender(user_id, item_ids, ratings):
""" Compute the recomendations for user_id over the item_ids with a user based collaborative filtering approach
Args:
user_id: id of the user to compute its output
item_ids: numpy array with the ids of the items whose rating has to be estimated
ratings: crs matrix with the ratings of all the users to the rated items
Returns:
pred_items: numpy array of dimensions 1 x number of items with the computed predictions for each item.
"""
# Get number of users
n_users = #<FILL IN>
# Get the number of items in items_id
n_items = #<FILL IN>
# Create variables to save incremental versions of numerator and denominator
rating_w_acc = np.zeros((1,n_items)) # Numerator (for each item there is a value)
sim_acc = 0 # Denominator
# Build a reduced matrix of ratings with only the columns corresponding to item_ids
ratings_items = #<FILL IN>
# Now we move user by user and compute the corresponding term of the numerator and denominator
for id_u in range(n_users):
# Compute the similarity of user_id with id_u
sim = #<FILL IN>
# If there is similarity ...
if sim>0:
# Get items rated by id_u, among item_ids, and their values
# (use sparse.find() over the row id_u of ratings_items )
[idx_users, pos_ratings_u, ratings_u] = #<FILL IN>
# If id_u has rated items among items_id ...
if pos_ratings_u.shape[0]>0:
# Get the mean value of all the items rated by id_u
mean_id_u = #<FILL IN>
# Update numerator (add term sim*(ratings_u-mean_id_u))
rating_w_acc[:,pos_ratings_u] = rating_w_acc[:,pos_ratings_u] + #<FILL IN>
# Update denominator (add sim)
sim_acc = sim_acc + #<FILL IN>
# Now, that all the terms of numerator and denominator are computed, calculate the predicted values
# 1. Get the mean value of all the items rated by user_id
mean_id_user = #<FILL IN>
# 2. Get predictions
# If this user has similar users (sim_acc>0)...
if sim_acc >0:
# Get predictions with general expresion
pred_items = #<FILL IN>
else: # else (cold start problem)...
# Give predictions as mean value (mean_id_user)
pred_items = #<FILL IN>
return pred_items
In [45]:
###########################################################
# TEST CELL
###########################################################
Test.assertEquals(np.round(np.sum(user_sim_recommender(20, np.array([2, 5, 8]), rating_train))), 9, 'incorrect result: correlation value is incorrect')
Now, let's evaluate the performance of this recommender over all the users
In [ ]:
###########################################################
# TODO: Replace <FILL IN> with appropriate code
###########################################################
print 'Please, be patient, this computation takes a while... '
# Compute the number of users and items
n_users, n_items = # <FILL IN>
# Create an empty prediction matrix in crs format
pred_ratings = sparse.lil_matrix((n_users, n_items))
# Work user by user
for u in test_users:
# Get, form the test matrix, the item id to be predicted for this user (check .indices attribute of crs matrix)
item_ids = # <FILL IN>
# Get predictions with the used based CF method for user u
pred_u = # <FILL IN>
# Build the prediction matrix
pred_ratings[u,item_ids] = # <FILL IN>
# Compute the error (MAE and RMSE) over test_users
MAE = # <FILL IN>
RMSE = # <FILL IN>
print 'MAE: %2.2f , RMSE: %2.2f ' %(MAE, RMSE)
In [47]:
###########################################################
# TEST CELL
###########################################################
Test.assertEquals(np.round(MAE,2), 0.82, 'incorrect result: MAE value is incorrect')
Test.assertEquals(np.round(RMSE,2), 1.02, 'incorrect result: RMSE value is incorrect')
Advance work: you can try to modify the function compute_Pearson_correlation() to force than a minimum number of common items have been rated (for instance, more than 4 items) to consider that two users can be similar. Check again the MAE and RMSE, have they been reduced? Is the method scalability improved?
In the file 'u.item' we have the information associated to each item. So, let's start loading this file...
In [48]:
moviesFilename = 'u.item'
f = open(moviesFilename)
rawMovies = f.readlines()
print rawMovies[:5]
Each line in the dataset is formatted as:
>> MovieID | Title | Date | url | Genres codification\n
Now, let's format each line to create a list of [MovieID, Title]. The remaining variables will be dropped because we do not need them for this exercise.
In [ ]:
###########################################################
# TODO: Replace <FILL IN> with appropriate code
###########################################################
formatMovies = #<FILL IN>
print formatMovies[:5]
In [50]:
###########################################################
# TEST CELL
###########################################################
Test.assertEquals((len(formatMovies), len(formatMovies[0])), (1682,2), 'incorrect result: formatMovies dimensions are incorrect')
Test.assertEquals(formatMovies[10], ['11', 'Seven (Se7en) (1995)'], 'incorrect result: formatMovies content is incorrect')
Once we have the movie information, let's analyze the set of ten movies that we would recommend to the user with 20. Complete the following cell, following the instructions...
In [ ]:
###########################################################
# TODO: Replace <FILL IN> with appropriate code
###########################################################
# 1. Compute the predictions, over all the items, that the user based system would provide for the user with id 20
user_id = 20
# Define the list of items as a list with all the item ids
item_ids = #<FILL IN>
# Get the predicitions (use user_sim_recommender() function)
list_pred = #<FILL IN>
# Remove useless dimensions of list_pred
list_pred = np.squeeze(list_pred)
# 2. Sort the list of predicted ratings, placing the highest ratings at the first
pos_ord = #<FILL IN>
# 3. Print the film titles with the ten highest ratings
for i in range(10):
# Get the id of the movie sorted at position i
id_movie = #<FILL IN>
print '%d: %s with rating %2.2f' %(i+1, formatMovies[id_movie][1], list_pred[id_movie])
In [58]:
###########################################################
# TEST CELL
###########################################################
Test.assertEquals(np.round(sum(list_pred[:10]),2), 37.63, 'incorrect result: list_pred is incorrect')
Test.assertEquals(sum(pos_ord[:10]), 1579, 'incorrect result: pos_ord is incorrect')
Finally, let's implement the ALS algorithm. As you, know this method tries to approximate the ratings matrix by factorizing it as the product of two matrices:
$$ R = X * Y $$where $X$ describes properties of each user, and $Y$ describes properties of each item. These two matrices are known as latent factors, since they are a low-dimension representation of users and items.
These matrices are computed such that the error for the users/items pairs over the known ratings is minimized:
$$ \min_{X, Y} \sum_{(u, i) \in P} \left( r_{u,i} - X_u Y_i \right)^2 + \lambda \left( ||X||^2 + ||Y||^2 \right) $$where P is the set known rated items.
The ALS algorithm does this by first randomly selecting the latent factors matrices $X$ and $Y$ and then, given the values of $Y$, it optimizes the values of $X$ such that the error is minimized:
$$ \min_{X} \sum_{(u, i) \in P} \left( r_{u,i} - X_u Y_i \right)^2 + \lambda ||X||^2 $$Then, by the computed users' factors ($X$), it optimizes the value of the item's matrix solving: $$ \min_{Y} \sum_{(u, i) \in P} \left( r_{u,i} - X_u Y_i \right)^2 + \lambda ||Y||^2 $$
This alternation between which matrix to optimize is the reason for the "alternating" in the name.
To solve this optimization problem, either for users or items factors, you can use the ridge regression
implementation of scikit-learn.
In [ ]:
###########################################################
# TODO: Replace <FILL IN> with appropriate code
###########################################################
from sklearn.linear_model import Ridge
def train_ALS(ratings, lambda_, n_factors):
""" Compute the latent factors of the ALS algorithm
Args:
ratings: crs matrix with the ratings of all the users to the rated items
lambda_: regularization parameter
n_factors: number of latent factors
Returns:
X, Y: latent factor matrices of users and items
"""
# Parameters
n_iterations = 20
# Get the number of users and items
n_users, n_items = # <FILL IN>
# Random initialization of latent factors
np.random.seed(0)
X = 5 * np.random.rand(n_users, n_factors)
Y = 5 * np.random.rand(n_factors, n_items)
# Define the classifier
clf = Ridge(alpha=lambda_, fit_intercept=False, max_iter=100,tol=0.01)
for ii in range(n_iterations):
for u in range(n_users):
# From ratings matrix get the rated items by user u (use toarray() method)
# Use np.squeeze to remove useless dimensions of r_u
r_u = # <FILL IN>
# Let's create an index matrix indicating the positions where there is or there isn't a rating
w_u = # <FILL IN>
# Solve the optimization problem
# Find X_u to minimize (w_u*(r_u-X[u,:]*Y)^2)
clf.fit(Y.T, r_u.T,w_u.T)
# Get the coefficients computed by the model and add it to the latent factor matrix
X[u,:] = # <FILL IN>
for i in range(n_items):
# From ratings matrix get the rating corresponding to item i (use toarray() method)
# Use np.squeeze to remove useless dimensions of r_i
r_i = # <FILL IN>
# Let's create an index matrix indicating the positions where there is or there isn't a rating
w_i = # <FILL IN>
# Solve the optimization problem
# Find Y_i to minimize (w_i*(r_i-X*Y[i,:])^2)
clf.fit(X, r_i,w_i)
# Get the coefficients computed by the model and add it to the latent factor matrix
Y[:,i] = # <FILL IN>
# To analyze error evolution
# Get predictions (use np.dot to multiply latent factor matrices)
pred_ratings = # <FILL IN>
# Compute the error (MAE and RMSE)
MAE = # <FILL IN>
RMSE = # <FILL IN>
print 'Iteration: %d, MAE: %2.2f , RMSE: %2.2f ' % (ii, MAE, RMSE)
return X, Y
In [60]:
# Test the ALS funtion
# parameters
lambda_ = 10
n_factors = 10
# Train the ALS model
X, Y = train_ALS(rating_train, lambda_, n_factors)
In [61]:
###########################################################
# TEST CELL
###########################################################
Test.assertEquals(np.round(np.mean(X),2), 0.13, 'incorrect result: X values are incorrect')
Test.assertEquals(np.round(np.mean(Y),2), 0.08, 'incorrect result: Y values are incorrect')
Now, let's compute the error over the test data
In [ ]:
###########################################################
# TODO: Replace <FILL IN> with appropriate code
###########################################################
# Get predictions (use np.dot to multiply latent factor matrices)
pred_ratings = # <FILL IN>
# Compute the error (MAE and RMSE) over test_users
MAE = # <FILL IN>
RMSE = # <FILL IN>
print 'MAE: %2.2f , RMSE: %2.2f ' % (MAE, RMSE)
In [63]:
###########################################################
# TEST CELL
###########################################################
Test.assertEquals(np.round(MAE,2), 0.78, 'incorrect result: MAE value is incorrect')
Test.assertEquals(np.round(RMSE,2), 1, 'incorrect result: RMSE value is incorrect')